This walk-through goes through the analysis of a BAMM run on our phylogenetic tree, which is short for Bayesian Analysis of Macroevolutionary Mixtures. This model identifies changes in speciation, extinction, and diversification rates on a tree without any knowledge of tip states or traits, and identifies the most likely number of transitions, where they are likely to occur, and outputs a bunch of other useful things.
From the online documentation of BAMM, we are following Section 8: Analysing BAMM Output with BAMMtools. We have previously ran bamm on our tree. Input information about that run here.
Results
This uses the ultrametric tree as re-estimated using a strict molecular clock model using phangorn.
Load in R packages
First we will load in R packages used and the metadata file used and wrangled in a previous walk-through.
Code
set.seed(42)# load packageslibrary(here)library(caper)library(ggtree)library(ggnewscale)library(RColorBrewer)library(patchwork)library(ape)library(phytools)library(BAMMtools)library(coda)library(MetBrewer)library(fastdivrate) # remotes::install_github("jonchang/fastdivrate")library(nlme)library(emmeans)library(tidyverse)# set where I am in the projecthere::i_am('scripts/sequencing_rpoB/analyses/phylogenetics/post_bamm_analysis.qmd')# read in habitat preferenced_habpref <-read.csv(here('data/sequencing_rpoB/phyloseq/myxococcus/habitat_preference/summary/habitat_preference_asv.csv'))# read in phyloseq object and grab tax tabled_taxa <-readRDS(here('data/sequencing_rpoB/phyloseq/myxococcus/prevalence_filtered/ps_otu_asv_filt.rds')) %>% phyloseq::tax_table() %>%data.frame() %>% janitor::clean_names() %>%rownames_to_column('otu')# create d_metad_meta <-left_join(select(d_habpref, otu, habitat_preference, num_present), select(d_taxa, otu:family))
Lineage through time plot
First we will look at a lineage through time plot of our ultrametric phylogenetic tree.
The times on the tree are not scaled between 0-1 as in the chronopl() tree.
Code
# load in treetree <- ape::read.tree(here('data/sequencing_rpoB/raxml/trees/myxo_asv/myxo_asv_phangorn.tre'))# check is rootedis.rooted(tree)
[1] TRUE
Code
# check is ultrametricis.ultrametric(tree)
[1] TRUE
Code
# grab lineage through time dataape::ltt.plot(tree)
Code
# create function to scale data between 0 and 1scale_01 <-function(x){(x-min(x))/(max(x)-min(x))}# create data framed_ltt <- ape::ltt.plot.coords(tree) %>%data.frame() %>%mutate(time2 =scale_01(time))# create lineage through time plotggplot(d_ltt, aes(time2, N)) +geom_line() +theme_bw(base_size =14) +labs(x ='Relative time',y ='Number of lineages')
Assess convergence of BAMM run
First we will assess convergence of our MCMC simulation.
Code
# read in mcmc output from bammmcmcout <-read.csv(here("data/sequencing_rpoB/bamm/bamm_asv_phangorn_mcmc_out.txt"), header=TRUE)max(mcmcout$generation)
[1] 24360000
Code
# discard some runs as burnin. We will discard the first 10% of samplesburnstart <-floor(0.3*nrow(mcmcout))postburn <- mcmcout[burnstart:nrow(mcmcout), ]# calculate effective sample sizeeffectiveSize(postburn$N_shifts)
var1
153.7014
Code
effectiveSize(postburn$logLik)
var1
138.6968
In general, we want the effective sample size values to be at least 200 (and 200 is on the low side, but might be reasonable for very large datasets). Both are now well above 100 with a 30% burnin. We can see that the run did not quite finish (on 25,000,000) iterations not 30,000,000, but the results should still be interpretable from experience.
Next we can look at the number of potential rate shifts.
Reading event datafile: /home/padpadpadpad/Desktop/myxo_diversification/data/sequencing_rpoB/bamm/bamm_asv_phangorn_event_data.txt
...........
Read a total of 1219 samples from posterior
Discarded as burnin: GENERATIONS < 7280000
Analyzing 855 samples from posterior
Setting recursive sequence on tree...
Done with recursive sequence
Code
shift_probs <-summary(edata)
Analyzed 855 posterior samples
Shift posterior distribution:
... omitted 18 rows
51 0.022
52 0.023
53 0.020
54 0.028
55 0.030
56 0.032
57 0.028
58 0.032
59 0.030
60 0.027
61 0.026
... omitted 44 rows
Compute credible set of shift configurations for more information:
See ?credibleShiftSet and ?getBestShiftConfiguration
Code
# plot theseggplot(shift_probs, aes(shifts, prob)) +geom_col(col ='black', fill ='light grey') +theme_bw(base_size =14) +labs(x ='Number of shifts',y ='Probability')
Code
# calculate 95% CIs for the number of shiftsn_shifts_ci <-tibble(mean_shifts =mean(postburn$N_shifts),lower_ci =quantile(postburn$N_shifts, 0.025),upper_ci =quantile(postburn$N_shifts, 0.975))n_shifts_ci
The model has converged, but it find loads more shifts, anything from 41 to 96, with an average of 65!.
The maintainers or BAMM suggest that (usually) the best overall model from a BAMM analysis is the model with the highest Bayes factor relative to the null model, \(M_{0}\), which has zero rate shifts. However, we do not have any samples of zero shifts in our postburn in sample, instead the minimum number of shifts is 30! However, we do have zero shifts in our preburn-in samples as can be seen here: 0.
We can therefore calculate Bayes factors from the mcmc_out.txt file. We are not going to have a burnin because otherwise we cannot sample the example of zero shifts.
Bayes factors greater than 20 generally imply strong evidence for one model over another; values greater than 50 are very strong evidence in favour of the numerator model. There is no definitive Bayes factor criterion for “significance”, but many researchers consider values greater than 12 to be consistent with at least some effect.
Code
# list filemcmc_file =here("data/sequencing_rpoB/bamm/bamm_asv_phangorn_mcmc_out.txt")# calculate Bayes Factorsbayes_factors <-computeBayesFactors(mcmc_file, expectedNumberOfShifts=500, burnin=0)# grab the columns for pairwise comparisons between 0 shifts and number of shiftsd_bayes_factors <- bayes_factors[,1] %>%data.frame() %>%rownames_to_column(var ='n_shifts') %>%rename(., bayes_factor =`.`)# we can rank bayes factors and then find the the difference between thesed_bayes_factors <-arrange(d_bayes_factors, desc(bayes_factor)) %>%mutate(diff =c(0, abs(diff(bayes_factor))),cum_diff =cumsum(diff))head(d_bayes_factors)
We can see the best model with 59 shifts is ranked the best but there are lots of other models that have similar levels of support. We can look at the number of shifts predicted from models within 20 of the best model with the best bayes factor.
Min. 1st Qu. Median Mean 3rd Qu. Max.
48.00 55.25 61.50 62.31 68.75 79.00
There are 26 models within 20 bayes factors, with an average number of shifts of 62, a min of 48 and mac of 79.
BAMMtools also has a function for visualising the prior and posterior simultaneously. This is useful to see what models are not being sampled in the posterior, and also to evaluate how far from the prior the posterior has moved.
Code
# use plotPrior to visualise the prior and posterior simultaneouslyd_prior <-plotPrior(mcmcout, expectedNumberOfShifts=500, burnin =0.3) %>%data.frame() %>% janitor::clean_names() %>%pivot_longer(cols =contains('probs'), names_to ='type', values_to ='prob', names_pattern ="(.*)_probs")
Code
ggplot(d_prior, aes(n_shifts, prob, fill = type)) +geom_bar(col ='black', stat ='identity', position=position_dodge()) +theme_bw(base_size =14) +scale_fill_manual('', values =c('grey', 'black'), labels =c('posterior', 'prior')) +labs(x ='Number of shifts',y ='Probability') +theme(legend.position =c(0.2, 0.8))
We can see that our posterior distribution has shifted from the prior which further reinforces our conclusion that the model has converged.
Plotting results from bamm model
We can now try and plot the mean, model-average diversification rates at any point along every branch of a phylogenetic tree. The standard code to do this takes so long to run that right we will not evaluate the code.
Code
plot.bammdata(edata, lwd=2)
We can calculate the credible shift set that is the set of distinct shift configurations that account for 95% of the probability of the data. Core shifts are those that contribute appreciably to your ability to model the data. Non-core shifts are simply ephemeral shifts that don’t really contribute anything: they are simply what you expect under the prior distribution for rate shifts across the tree.
Code
# calculate credible shift setd_css <-credibleShiftSet(edata, expectedNumberOfShifts =500, threshold =5, set.limit =0.95)# number of distinct configurations in the datad_css$number.distinct
[1] 800
Code
# view more information about the credible setsummary(d_css)
95 % credible set of rate shift configurations sampled with BAMM
Distinct shift configurations in credible set: 800
Frequency of 9 shift configurations with highest posterior probability:
rank probability cumulative Core_shifts
1 0.002339181 0.002339181 14
2 0.002339181 0.004678363 14
3 0.002339181 0.007017544 13
4 0.002339181 0.009356725 14
5 0.002339181 0.011695906 15
6 0.002339181 0.014035088 15
7 0.002339181 0.016374269 14
8 0.002339181 0.018713450 13
9 0.002339181 0.021052632 12
...omitted 791 additional distinct shift configurations
from the credible set. You can access the full set from your
credibleshiftset object
Code
# can plot this credible shift setplot.credibleshiftset(d_css)
Omitted 791 plots
However, as can be seen from the summary, even the single best shift configuration has a really low posterior probability (0.0023). Although the best model supports 60 shifts, not all of these are core shifts. There are only 14 core shifts identified in the best model shift configurations here.
For some datasets with large numbers of taxa and rate shifts (e.g., trees with thousands of taxa), all shift configurations may have low probability. There are simply too many parameters in the model to allow a single shift configuration to dominate the credible set. An alternative approach is to extract the shift configuration that maximises the marginal probability of rate shifts along individual branches. This is very similar to the idea of a maximum clade credibility tree in phylogenetic analysis. BAMM has a function maximumShiftCredibility for extracting this shift configuration:
Code
# calculate max shift credibilitymsc_set <-maximumShiftCredibility(edata, maximize='product')# grab the best configuration and plot itmsc_config <-subsetEventData(edata, index = msc_set$sampleindex)plot.bammdata(msc_config, lwd=2)addBAMMshifts(msc_config, cex =2)
This picks a model with 30 rate shifts over the whole tree.
We can try and plot these shifts using ggtree. First we will look at the distribution of speciation rates across the tree. We need to be careful about our colour scale to prevent it being misleading.
Here we also change the tip labels of our tree so they rematch with those from our metadata.
Code
# get mean phylorates that underly the colorised plot produced by plot.bammdata# from here: https://groups.google.com/g/bamm-project/c/W6s38xzm6OU/m/LALF47xVS54J#mbt <- getMeanBranchLengthTree(edata, rate = "speciation")# get the mean branch lengths from the best tree configuration as identified from maximumShiftCredibility mbt2 <-getMeanBranchLengthTree(msc_config, rate ="ndr")# get shift nodes from "best model"shiftnodes <-getShiftNodesFromIndex(edata, index = msc_set$sampleindex)# get treetree_bamm <- mbt2$phy# get the edge lengths in a dataframed_tree_bamm <-data.frame(tree_bamm$edge, edge_num=1:nrow(tree_bamm$edge), edge_length = tree_bamm$edge.length)colnames(d_tree_bamm)=c("parent", "node", "edge_num", 'edge_length')# transform these to log for the colour scaled_tree_bamm <-mutate(d_tree_bamm, log_edge_length =log(edge_length))# visualise edge lengths that represent speciation rates of each branch# similar to visualising colour breaks http://bamm-project.org/colorbreaks.html#how-do-i-plot-these-histogramsp1 <-ggplot(d_tree_bamm, aes(edge_length)) +geom_histogram(col ='black', fill ='light grey') +theme_bw()p2 <-ggplot(d_tree_bamm, aes(log(edge_length))) +geom_histogram(col ='black', fill ='light grey') +theme_bw()p1 + p2
We will use the log edge lengths/rates of diversification because they result in a more even spread of diversification rates across a range of values. However it is obviously extremely important to realise this emphasises rate variation in colour changes.
Now we can plot the tree. We will try plot the family assignment around the edge of the tree. We also need to assign the colours the same as before.
Code
# constrained familiesconstrained_families <-c('Myxococcaceae', 'Vulgatibacteraceae', 'Anaeromyxobacteraceae', 'Polyangiaceae', 'Sandaracinaceae', 'Nannocystaceae', 'Haliangiaceae')# reorder d_meta so the tip labesl link to the order of the tips in the treed_meta <-tibble(tip_label = tree_bamm$tip.label) %>%left_join(., rename(d_meta, tip_label = otu))# find the mrca of each of the constrained familiesd_meta2 <-filter(d_meta, family %in% constrained_families) %>% dplyr::select(family, tip_label) %>%group_by(family) %>%nest() %>%mutate(mrca =NA)for(i in1:nrow(d_meta2)){ d_meta2$mrca[i] <-findMRCA(tree_bamm, tips = d_meta2$data[[i]]$tip_label)}d_meta2 <- dplyr::select(d_meta2, family2 = family, mrca) %>%mutate(blank_label ='')# add colour for the different familiescols <-c(colorRampPalette(brewer.pal(11, "Spectral"))(nrow(d_meta2)))names(cols) <-sort(d_meta2$family2)# define colours for the different habitatscols_hab <-met.brewer('Austria', n =7)names(cols_hab) <-c('marine_mud', 'freshwater', 'terrestrial', 'freshwater:terrestrial', 'generalist', 'marine_mud:terrestrial', 'freshwater:marine_mud')hab_labels <-c('marine mud', 'freshwater', 'terrestrial', 'freshwater + terrestrial', 'generalist', 'marine mud + terrestrial', 'freshwater + marine mud')# remove any tip labels not in our tree from d_metad_meta <-filter(d_meta, tip_label %in% tree_bamm$tip.label)# make a separate column for the rare states to make their size bigger!d_meta <-mutate(d_meta, rare =ifelse(habitat_preference %in%c('generalist', 'marine_mud:terrestrial'), 'rare', 'common'))# plot tree using ggtree# first colour branches and add rate shiftsp1 <-ggtree(tree_bamm, layout ='circular', branch.length ='none', aes(col = log_edge_length)) %<+% d_tree_bamm +scale_color_gradientn('Net diversification (branch colours)', colors =met.brewer(name='Hiroshige', direction=-1, override.order = F), breaks=c(min(d_tree_bamm$log_edge_length, na.rm =TRUE) +abs(min(d_tree_bamm$log_edge_length, na.rm =TRUE))*0.2, max(d_tree_bamm$log_edge_length, na.rm =TRUE) *0.95), labels=c("Slow","Fast")) +geom_point2(aes(subset=(node %in% shiftnodes)), color="black",size=5)# next add tip pointsp2 <- p1 %<+% d_meta +new_scale_color() +geom_tippoint(aes(x=x+5, col = habitat_preference, size = rare), position =position_jitter(width =3, height =0)) +scale_color_manual('Habitat preference (tip points)', values = cols_hab, labels = hab_labels) +scale_size_manual(values =c(0.6, 2)) +guides(color =guide_legend(override.aes =list(size =5)),size ='none')p2 +new_scale_colour() +geom_cladelab(data = d_meta2,mapping =aes(node = mrca,color = family2,label = blank_label),offset =10,barsize =2) +scale_color_manual('Family (outer bar)', values = cols) +guides(color =guide_legend(override.aes =list(size =5)))
We can look at diversification rates change through time to see if diversification in general is faster or slower deeper in the tree, and therefore further back in evolutionary time.
We can also grab the rates through time for all the sub-trees after a rate shift to see how they compare. To do this we need to write our own function to process the data and get it ready for plotting.
Code
# write function to get rate through time into the correct formatget_rate_through_time_df <-function(ephy, ...){ rtt <-getRateThroughTimeMatrix(ephy, ...)# get dataframe of each part# first speciation rtt_sp <- rtt$lambda %>%data.frame() %>%mutate(sample =1:n()) %>%pivot_longer(starts_with('X'), names_to ='time_point', values_to ='speciation') %>%mutate(time_point =parse_number(time_point)) %>%group_by(sample) %>%mutate(time =unname(rtt$times)) %>%ungroup()# second extinction rtt_ex <- rtt$mu %>%data.frame() %>%mutate(sample =1:n()) %>%pivot_longer(starts_with('X'), names_to ='time_point', values_to ='extinction') %>%mutate(time_point =parse_number(time_point)) %>%group_by(sample) %>%mutate(time =unname(rtt$times)) %>%ungroup() rtt_comb <-left_join(rtt_sp, rtt_ex) rtt_comb <-mutate(rtt_comb, net_div = speciation - extinction) %>%pivot_longer(cols =c(speciation, extinction, net_div), names_to ='process', values_to ='rate')return(rtt_comb)}# get rate through time estimates for the whole treertt_all <-get_rate_through_time_df(ephy = edata)# create meansrtt_combine_means <-group_by(rtt_all, time_point, process, time) %>%summarise(ave_rate =mean(rate), .groups ='drop',lower_ci =quantile(rate, 0.025),upper_ci =quantile(rate, 0.975))# get rate through time estimates for each shift nodertt_shift <-tibble(shift_node = shiftnodes,n =1:length(shiftnodes)) %>%nest(data = shift_node) %>%mutate(temp = purrr::map(data, possibly(~get_rate_through_time_df(ephy = edata, node = .x$shift_node, nodetype ='include'), otherwise =NA_real_)),is_tib = purrr::map_dbl(temp, is_tibble))rtt_shift_nowork <-filter(rtt_shift, is_tib ==0) %>% dplyr::select(data) %>%unnest(data)rtt_shift2 <-filter(rtt_shift, is_tib ==1) %>% dplyr::select(data, temp) %>%unnest(data) %>%unnest(temp)# create meansrtt_shift_means <-group_by(rtt_shift2, time_point, process, time, shift_node) %>%summarise(ave_rate =mean(rate), .groups ='drop',lower_ci =quantile(rate, 0.025),upper_ci =quantile(rate, 0.975))# create a plotggplot() +geom_line(aes(time, ave_rate, group = shift_node), rtt_shift_means, col ='dark grey') +geom_ribbon(aes(time, ymin = lower_ci, ymax = upper_ci, group = shift_node), alpha =0.01, rtt_shift_means, fill ='dark grey') +geom_line(aes(time, ave_rate), rtt_combine_means) +geom_ribbon(aes(time, ymin = lower_ci, ymax = upper_ci), alpha =0.1, rtt_combine_means) +facet_wrap(~process, scales ='free') +theme_bw(base_size =14) +labs(x ='Relative time',y ='rate')
Code
# save plot outggsave(here('plots/sequencing_rpoB/analyses/bamm_rate_through_time_phangorn.png'), last_plot(), height =5, width =12)
Compared to the penalized likelihood smoothing model, we can see very drastic declines in net diversification driven by rapid decreases in speciation. Generally the rate shifts are not as deep in tree and result in static speciation rate through time.
You can see that after a shift node there is generally a rapid acceleration in diversification and speciation rate, after which we see a decline, whereas extinction rate rises quickly and then plateaus through time. The uncertainty around these node specific estimates is very high, but the general pattern appears to hold.
Looking at tip-specific evolutionary rates
We can also estimate tip-specific evolutionary rates. We can plot these across our different traits to see if there is anything particularly interesting going on. This could be complemented by our models looking at state-dependent character diversification rates.
The first method is to grab the tip-specific rate from the BAMM model.
p1 <-ggplot(tip_rates, aes(forcats::fct_reorder(hab_pref_axis, n), speciation)) + MicrobioUoE::geom_pretty_boxplot(col='black', fill ='black') +geom_point(shape =21, fill ='white', position =position_jitter(width =0.2)) +theme_bw(base_size =12) +scale_x_discrete(labels = scales::label_wrap(13)) +labs(x ='Habitat preference',y ='Speciation rate')p2 <-ggplot(tip_rates, aes(forcats::fct_reorder(hab_pref_axis, n), extinction)) + MicrobioUoE::geom_pretty_boxplot(col='black', fill ='black') +geom_point(shape =21, fill ='white', position =position_jitter(width =0.2)) +theme_bw(base_size =12) +scale_x_discrete(labels = scales::label_wrap(13)) +labs(x ='Habitat preference',y ='Extinction rate')p3 <-ggplot(tip_rates, aes(forcats::fct_reorder(hab_pref_axis, n), net_diversification)) + MicrobioUoE::geom_pretty_boxplot(col='black', fill ='black') +geom_point(shape =21, fill ='white', position =position_jitter(width =0.2)) +theme_bw(base_size =12) +scale_x_discrete(labels = scales::label_wrap(13)) +labs(x ='Habitat preference',y ='Net diversification rate')p1 + p2 + p3
Code
# save plot outggsave(here('plots/sequencing_rpoB/analyses/bamm_tip_rates_phangorn.png'), last_plot(), height =5, width =17)
There does not appear to be much variation between traits in their speciation and extinction rates. If we wanted to follow this further there are lots of papers we could look at. First and foremost would be to read Title & Rabosky’s paper entitled “Tip rates, phylogenies and diversification: What are we estimating, and how good are the estimates?”. They also share other papers that have used tip rate estimates to look at diversification rates across geographical and environmental gradients and across different traits.
A second method would be to calculate the DR statistic (also known as tip DR) which is an tip-level estimate of the speciation rate separate from the BAMM model. This measure is non-model based, and incorporates the number of splitting events and the internode distances along the root-to-tip path of a phylogeny, while giving greater weight to branches closer to the present. This was first implemented by Jetz et al. (2012) in their Nature paper about bird diversity in space and time.
Code
# compute tip DRd_tipdr <-DR_statistic_C(tree)# put this into a dataframed_tipdr <-data.frame(tip_label =names(d_tipdr),tip_dr =unname(d_tipdr)) %>%separate(., tip_label, c('part1', 'part2', 'part3'), sep ='_') %>%unite('tip_label', c(part1, part2), sep ='_') %>% dplyr::select(-part3) %>%left_join(., dplyr::select(d_meta, tip_label, habitat_preference)) %>%filter(!is.na(habitat_preference)) %>%group_by(habitat_preference) %>%mutate(n =n()) %>%ungroup() %>%mutate(hab_pref_axis =gsub(':', '/ ', habitat_preference),hab_pref_axis =gsub('_', ' ', hab_pref_axis))ggplot(d_tipdr, aes(forcats::fct_reorder(hab_pref_axis, n), tip_dr)) + MicrobioUoE::geom_pretty_boxplot(col='black', fill ='black') +geom_point(shape =21, fill ='white', position =position_jitter(width =0.2)) +theme_bw(base_size =14) +scale_x_discrete(labels = scales::label_wrap(13)) +labs(x ='Habitat preference',y ='tip DR (speciation rate)')
This method also shows very little variation in the estimated tip-level speciation rate between different habitat preferences.
This again is something to discuss further with Rutger as I am not sure exactly where to go with the remainder of this analysis. These sorts of statistics may be very useful where fitting of SSE models with rate changes within the state are not possible. This may be something to consider with our dataset where the number of parameters in the MuHiSSE would be so big.
We can can do phylogenetic generalised least squares regressions to account for phylogenteic relatedness concerning differences in diversification rate between species with different habitat preferences. This approach has been used by Jetz et al. (2012) in their Nature paper about bird diversity in space and time.
We will first do this using nlme::lme() as described in Liam Revell’s and Luke Harmon’s book “Phylogenetic Comparative Methods in R”.
Code
# set up correlation matrix for the treecor_lambda <-corPagel(value =1, phy = tree, form =~tip_label)# fit phylogenetic generalised linear modelmod <-gls(tip_dr ~ habitat_preference, data = d_tipdr, correlation = cor_lambda)summary(mod)# do contrasts between habitat preferencescontrasts <-emmeans(mod, pairwise ~ habitat_preference)contrasts$contrasts
So the model fits and is ok. The estimate of Pagel’s lambda - the phylogenetic signal - is 0.79 which is quite high. There are no significant differences between any of the habitat preferences in terms of their speciation rate though.
Defining each tip as either high diversification or low diversification
One of the main questions we are interested in is whether we see a increase in the diversification rate AFTER a transition, but this is a difficult thing to test using either the BAMM model or state-dependent speciation and extinction models.
One alternative might be to create a new variable of high and low diversification rate for each tip, and then combine this with habitat preference when we explore transition rate models.
We can look at the distribution of tip rates, irregardless of habitat preference.
Code
ggplot(tip_rates, aes(net_diversification)) +geom_histogram(fill ='white', col ='black') +theme_bw()
There is a hint that this model is bimodal. We can run a k-means clustering on it to try and split it into two groups.
Code
# do very crude kmeans clusteringmod_clust <-kmeans(tip_rates$net_diversification, 2, algo="Lloyd")# add this to our datasettip_rates <-mutate(tip_rates, cluster =as.character(mod_clust$cluster),div_rate =ifelse(cluster =='1', 'low', 'high'))ggplot(tip_rates, aes(net_diversification, fill = div_rate)) +geom_histogram(col ='black') +theme_bw(base_size =14) +theme(legend.position =c(0.8, 0.8))
We can count the number of each combination of diversification rate (high or low) and each habitat preference to see whether we have some common and rare states.
Code
# check number in each diversity rate bin: high and lowgroup_by(tip_rates, div_rate) %>%tally()
# A tibble: 2 × 2
div_rate n
<chr> <int>
1 high 2183
2 low 438
# A tibble: 13 × 4
habitat_preference div_rate n prop
<chr> <chr> <int> <dbl>
1 freshwater high 561 0.76
2 freshwater low 181 0.24
3 freshwater:marine_mud high 89 0.77
4 freshwater:marine_mud low 27 0.23
5 freshwater:marine_mud:terrestrial high 3 0.75
6 freshwater:marine_mud:terrestrial low 1 0.25
7 freshwater:terrestrial high 389 0.81
8 freshwater:terrestrial low 93 0.19
9 marine_mud high 531 0.94
10 marine_mud low 36 0.06
11 marine_mud:terrestrial high 6 1
12 terrestrial high 604 0.86
13 terrestrial low 100 0.14
Ok this looks semi-sensible. We will now save this dataset out to try and fit some transition models to the habitat preference by diversification rate bin.
Chapter 13 of Luke Harmon’s book on phylogenetic comparative methods on “Characters and diversification rates”.
Documentation of the R package ggtree used for plotting phylogenies.
Documentation for bamm - Bayesian Analysis of Macroevolutionary Mixtures.
Chapter 3 of Liam Revell and Luke Harmon’s new book on Phylogenetic Comparative Methods in R.
Chapter 6 of Natalie Cooper’s book cover phylogenetic generalised least squares regression in R using caper.
Source Code
---title: "Exploring rates of speciation in the tree using BAMM"author: "Daniel Padfield"date: last-modifiedformat: html: toc: true toc-depth: 2 toc-title: 'Contents' code-overflow: wrap code-fold: true code-tools: true self-contained: true self-contained-math: trueexecute: message: false warning: false fig.align: 'center'editor: visual---## OutlineThis walk-through goes through the analysis of a BAMM run on our phylogenetic tree, which is short for Bayesian Analysis of Macroevolutionary Mixtures. This model identifies changes in speciation, extinction, and diversification rates on a tree without any knowledge of tip states or traits, and identifies the most likely number of transitions, where they are likely to occur, and outputs a bunch of other useful things.From the online documentation of BAMM, we are following [Section 8](http://bamm-project.org/postprocess.html#bammtools): Analysing BAMM Output with BAMMtools. We have previously ran **bamm** on our tree. *Input information about that run here.*## Results- This uses the ultrametric tree as re-estimated using a strict molecular clock model using **phangorn**.- ## Load in R packagesFirst we will load in R packages used and the metadata file used and wrangled in a previous walk-through.```{r load_packages}#| results: falseset.seed(42)# load packageslibrary(here)library(caper)library(ggtree)library(ggnewscale)library(RColorBrewer)library(patchwork)library(ape)library(phytools)library(BAMMtools)library(coda)library(MetBrewer)library(fastdivrate) # remotes::install_github("jonchang/fastdivrate")library(nlme)library(emmeans)library(tidyverse)# set where I am in the projecthere::i_am('scripts/sequencing_rpoB/analyses/phylogenetics/post_bamm_analysis.qmd')# read in habitat preferenced_habpref <-read.csv(here('data/sequencing_rpoB/phyloseq/myxococcus/habitat_preference/summary/habitat_preference_asv.csv'))# read in phyloseq object and grab tax tabled_taxa <-readRDS(here('data/sequencing_rpoB/phyloseq/myxococcus/prevalence_filtered/ps_otu_asv_filt.rds')) %>% phyloseq::tax_table() %>%data.frame() %>% janitor::clean_names() %>%rownames_to_column('otu')# create d_metad_meta <-left_join(select(d_habpref, otu, habitat_preference, num_present), select(d_taxa, otu:family))```## Lineage through time plotFirst we will look at a lineage through time plot of our ultrametric phylogenetic tree.The times on the tree are not scaled between 0-1 as in the **chronopl()** tree.```{r ltt}#| fig.height: 5#| fig.width: 7# load in treetree <- ape::read.tree(here('data/sequencing_rpoB/raxml/trees/myxo_asv/myxo_asv_phangorn.tre'))# check is rootedis.rooted(tree)# check is ultrametricis.ultrametric(tree)# grab lineage through time dataape::ltt.plot(tree)# create function to scale data between 0 and 1scale_01 <-function(x){(x-min(x))/(max(x)-min(x))}# create data framed_ltt <- ape::ltt.plot.coords(tree) %>%data.frame() %>%mutate(time2 =scale_01(time))# create lineage through time plotggplot(d_ltt, aes(time2, N)) +geom_line() +theme_bw(base_size =14) +labs(x ='Relative time',y ='Number of lineages')```## Assess convergence of BAMM runFirst we will assess convergence of our MCMC simulation.```{r bamm_check_convergence}# read in mcmc output from bammmcmcout <-read.csv(here("data/sequencing_rpoB/bamm/bamm_asv_phangorn_mcmc_out.txt"), header=TRUE)max(mcmcout$generation)# discard some runs as burnin. We will discard the first 10% of samplesburnstart <-floor(0.3*nrow(mcmcout))postburn <- mcmcout[burnstart:nrow(mcmcout), ]# calculate effective sample sizeeffectiveSize(postburn$N_shifts)effectiveSize(postburn$logLik)```In general, we want the effective sample size values to be at least 200 (and 200 is on the low side, but might be reasonable for very large datasets). Both are now well above 100 with a 30% burnin. We can see that the run did not quite finish (on 25,000,000) iterations not 30,000,000, but the results should still be interpretable from experience.Next we can look at the number of potential rate shifts.```{r check_rate_shifts}#| fig.height: 4#| fig.width: 6post_probs <-table(postburn$N_shifts) /nrow(postburn)names(post_probs)edata <-getEventData(tree, eventdata =here('data/sequencing_rpoB/bamm/bamm_asv_phangorn_event_data.txt'), burnin =0.3)shift_probs <-summary(edata)# plot theseggplot(shift_probs, aes(shifts, prob)) +geom_col(col ='black', fill ='light grey') +theme_bw(base_size =14) +labs(x ='Number of shifts',y ='Probability')# calculate 95% CIs for the number of shiftsn_shifts_ci <-tibble(mean_shifts =mean(postburn$N_shifts),lower_ci =quantile(postburn$N_shifts, 0.025),upper_ci =quantile(postburn$N_shifts, 0.975))n_shifts_ci```The model has converged, but it find loads more shifts, anything from 41 to 96, with an average of 65!.The maintainers or BAMM suggest that (usually) the best overall model from a BAMM analysis is the model with the highest Bayes factor relative to the null model, $M_{0}$, which has zero rate shifts. However, we do not have any samples of zero shifts in our postburn in sample, instead the minimum number of shifts is `r min(postburn$N_shifts)`! However, we do have zero shifts in our preburn-in samples as can be seen here: `r min(mcmcout$N_shifts)`.We can therefore calculate Bayes factors from the `mcmc_out.txt` file. We are not going to have a burnin because otherwise we cannot sample the example of zero shifts.Bayes factors greater than 20 generally imply strong evidence for one model over another; values greater than 50 are very strong evidence in favour of the numerator model. There is no definitive Bayes factor criterion for "significance", but many researchers consider values greater than 12 to be consistent with at least some effect.```{r bamm_bayes_factors}# list filemcmc_file =here("data/sequencing_rpoB/bamm/bamm_asv_phangorn_mcmc_out.txt")# calculate Bayes Factorsbayes_factors <-computeBayesFactors(mcmc_file, expectedNumberOfShifts=500, burnin=0)# grab the columns for pairwise comparisons between 0 shifts and number of shiftsd_bayes_factors <- bayes_factors[,1] %>%data.frame() %>%rownames_to_column(var ='n_shifts') %>%rename(., bayes_factor =`.`)# we can rank bayes factors and then find the the difference between thesed_bayes_factors <-arrange(d_bayes_factors, desc(bayes_factor)) %>%mutate(diff =c(0, abs(diff(bayes_factor))),cum_diff =cumsum(diff))head(d_bayes_factors)```We can see the best model with 59 shifts is ranked the best but there are lots of other models that have similar levels of support. We can look at the number of shifts predicted from models within 20 of the best model with the best bayes factor.```{r best_model_bayes}filter(d_bayes_factors, cum_diff <20) %>%pull(n_shifts) %>%as.numeric() %>%summary()```There are 26 models within 20 bayes factors, with an average number of shifts of 62, a min of 48 and mac of 79.**BAMMtools** also has a function for visualising the prior and posterior simultaneously. This is useful to see what models are not being sampled in the posterior, and also to evaluate how far from the prior the posterior has moved.```{r plot_prior_posterior}#| fig.height: 5#| fig.width: 7# use plotPrior to visualise the prior and posterior simultaneouslyd_prior <-plotPrior(mcmcout, expectedNumberOfShifts=500, burnin =0.3) %>%data.frame() %>% janitor::clean_names() %>%pivot_longer(cols =contains('probs'), names_to ='type', values_to ='prob', names_pattern ="(.*)_probs")ggplot(d_prior, aes(n_shifts, prob, fill = type)) +geom_bar(col ='black', stat ='identity', position=position_dodge()) +theme_bw(base_size =14) +scale_fill_manual('', values =c('grey', 'black'), labels =c('posterior', 'prior')) +labs(x ='Number of shifts',y ='Probability') +theme(legend.position =c(0.2, 0.8))```We can see that our posterior distribution has shifted from the prior which further reinforces our conclusion that the model has converged.## Plotting results from bamm modelWe can now try and plot the mean, model-average diversification rates at any point along every branch of a phylogenetic tree. The standard code to do this takes so long to run that right we will not evaluate the code.```{r plot_diversification_rates}#| eval: falseplot.bammdata(edata, lwd=2)```We can calculate the credible shift set that is the set of distinct shift configurations that account for 95% of the probability of the data. Core shifts are those that contribute appreciably to your ability to model the data. Non-core shifts are simply ephemeral shifts that don't really contribute anything: they are simply what you expect under the prior distribution for rate shifts across the tree.```{r credible_shift_sets}#| fig.height: 7#| fig.width: 8# calculate credible shift setd_css <-credibleShiftSet(edata, expectedNumberOfShifts =500, threshold =5, set.limit =0.95)# number of distinct configurations in the datad_css$number.distinct# view more information about the credible setsummary(d_css)# can plot this credible shift setplot.credibleshiftset(d_css)```However, as can be seen from the summary, even the single best shift configuration has a really low posterior probability (0.0023). Although the best model supports 60 shifts, not all of these are core shifts. There are only 14 core shifts identified in the best model shift configurations here.For some datasets with large numbers of taxa and rate shifts (e.g., trees with thousands of taxa), all shift configurations may have low probability. There are simply too many parameters in the model to allow a single shift configuration to dominate the credible set. An alternative approach is to extract the shift configuration that maximises the marginal probability of rate shifts along individual branches. This is very similar to the idea of a maximum clade credibility tree in phylogenetic analysis. BAMM has a function `maximumShiftCredibility` for extracting this shift configuration:```{r max_clade_cred}#| fig.height: 6#| fig.width: 7# calculate max shift credibilitymsc_set <-maximumShiftCredibility(edata, maximize='product')# grab the best configuration and plot itmsc_config <-subsetEventData(edata, index = msc_set$sampleindex)plot.bammdata(msc_config, lwd=2)addBAMMshifts(msc_config, cex =2)```This picks a model with `r length(getShiftNodesFromIndex(edata, index = msc_set$sampleindex))` rate shifts over the whole tree.We can try and plot these shifts using **ggtree**. First we will look at the distribution of speciation rates across the tree. We need to be careful about our colour scale to prevent it being misleading.Here we also change the tip labels of our tree so they rematch with those from our metadata.```{r plot_ggtree_prep}#| fig.height: 5#| fig.width: 12# get mean phylorates that underly the colorised plot produced by plot.bammdata# from here: https://groups.google.com/g/bamm-project/c/W6s38xzm6OU/m/LALF47xVS54J#mbt <- getMeanBranchLengthTree(edata, rate = "speciation")# get the mean branch lengths from the best tree configuration as identified from maximumShiftCredibility mbt2 <-getMeanBranchLengthTree(msc_config, rate ="ndr")# get shift nodes from "best model"shiftnodes <-getShiftNodesFromIndex(edata, index = msc_set$sampleindex)# get treetree_bamm <- mbt2$phy# get the edge lengths in a dataframed_tree_bamm <-data.frame(tree_bamm$edge, edge_num=1:nrow(tree_bamm$edge), edge_length = tree_bamm$edge.length)colnames(d_tree_bamm)=c("parent", "node", "edge_num", 'edge_length')# transform these to log for the colour scaled_tree_bamm <-mutate(d_tree_bamm, log_edge_length =log(edge_length))# visualise edge lengths that represent speciation rates of each branch# similar to visualising colour breaks http://bamm-project.org/colorbreaks.html#how-do-i-plot-these-histogramsp1 <-ggplot(d_tree_bamm, aes(edge_length)) +geom_histogram(col ='black', fill ='light grey') +theme_bw()p2 <-ggplot(d_tree_bamm, aes(log(edge_length))) +geom_histogram(col ='black', fill ='light grey') +theme_bw()p1 + p2```We will use the log edge lengths/rates of diversification because they result in a more even spread of diversification rates across a range of values. However it is obviously extremely important to realise this emphasises rate variation in colour changes.Now we can plot the tree. We will try plot the family assignment around the edge of the tree. We also need to assign the colours the same as before.```{r plot_ggtree}#| fig.height: 12#| fig.width: 12# constrained familiesconstrained_families <-c('Myxococcaceae', 'Vulgatibacteraceae', 'Anaeromyxobacteraceae', 'Polyangiaceae', 'Sandaracinaceae', 'Nannocystaceae', 'Haliangiaceae')# reorder d_meta so the tip labesl link to the order of the tips in the treed_meta <-tibble(tip_label = tree_bamm$tip.label) %>%left_join(., rename(d_meta, tip_label = otu))# find the mrca of each of the constrained familiesd_meta2 <-filter(d_meta, family %in% constrained_families) %>% dplyr::select(family, tip_label) %>%group_by(family) %>%nest() %>%mutate(mrca =NA)for(i in1:nrow(d_meta2)){ d_meta2$mrca[i] <-findMRCA(tree_bamm, tips = d_meta2$data[[i]]$tip_label)}d_meta2 <- dplyr::select(d_meta2, family2 = family, mrca) %>%mutate(blank_label ='')# add colour for the different familiescols <-c(colorRampPalette(brewer.pal(11, "Spectral"))(nrow(d_meta2)))names(cols) <-sort(d_meta2$family2)# define colours for the different habitatscols_hab <-met.brewer('Austria', n =7)names(cols_hab) <-c('marine_mud', 'freshwater', 'terrestrial', 'freshwater:terrestrial', 'generalist', 'marine_mud:terrestrial', 'freshwater:marine_mud')hab_labels <-c('marine mud', 'freshwater', 'terrestrial', 'freshwater + terrestrial', 'generalist', 'marine mud + terrestrial', 'freshwater + marine mud')# remove any tip labels not in our tree from d_metad_meta <-filter(d_meta, tip_label %in% tree_bamm$tip.label)# make a separate column for the rare states to make their size bigger!d_meta <-mutate(d_meta, rare =ifelse(habitat_preference %in%c('generalist', 'marine_mud:terrestrial'), 'rare', 'common'))# plot tree using ggtree# first colour branches and add rate shiftsp1 <-ggtree(tree_bamm, layout ='circular', branch.length ='none', aes(col = log_edge_length)) %<+% d_tree_bamm +scale_color_gradientn('Net diversification (branch colours)', colors =met.brewer(name='Hiroshige', direction=-1, override.order = F), breaks=c(min(d_tree_bamm$log_edge_length, na.rm =TRUE) +abs(min(d_tree_bamm$log_edge_length, na.rm =TRUE))*0.2, max(d_tree_bamm$log_edge_length, na.rm =TRUE) *0.95), labels=c("Slow","Fast")) +geom_point2(aes(subset=(node %in% shiftnodes)), color="black",size=5)# next add tip pointsp2 <- p1 %<+% d_meta +new_scale_color() +geom_tippoint(aes(x=x+5, col = habitat_preference, size = rare), position =position_jitter(width =3, height =0)) +scale_color_manual('Habitat preference (tip points)', values = cols_hab, labels = hab_labels) +scale_size_manual(values =c(0.6, 2)) +guides(color =guide_legend(override.aes =list(size =5)),size ='none')p2 +new_scale_colour() +geom_cladelab(data = d_meta2,mapping =aes(node = mrca,color = family2,label = blank_label),offset =10,barsize =2) +scale_color_manual('Family (outer bar)', values = cols) +guides(color =guide_legend(override.aes =list(size =5)))ggsave(here('plots/sequencing_rpoB/analyses/bamm_tree_phangorn.png'), last_plot(), height =9, width =12)```## Rate variation through timeWe can look at diversification rates change through time to see if diversification in general is faster or slower deeper in the tree, and therefore further back in evolutionary time.We can also grab the rates through time for all the sub-trees after a rate shift to see how they compare. To do this we need to write our own function to process the data and get it ready for plotting.```{r get_rate_through_time}#| fig.height: 5#| fig.width: 17# write function to get rate through time into the correct formatget_rate_through_time_df <-function(ephy, ...){ rtt <-getRateThroughTimeMatrix(ephy, ...)# get dataframe of each part# first speciation rtt_sp <- rtt$lambda %>%data.frame() %>%mutate(sample =1:n()) %>%pivot_longer(starts_with('X'), names_to ='time_point', values_to ='speciation') %>%mutate(time_point =parse_number(time_point)) %>%group_by(sample) %>%mutate(time =unname(rtt$times)) %>%ungroup()# second extinction rtt_ex <- rtt$mu %>%data.frame() %>%mutate(sample =1:n()) %>%pivot_longer(starts_with('X'), names_to ='time_point', values_to ='extinction') %>%mutate(time_point =parse_number(time_point)) %>%group_by(sample) %>%mutate(time =unname(rtt$times)) %>%ungroup() rtt_comb <-left_join(rtt_sp, rtt_ex) rtt_comb <-mutate(rtt_comb, net_div = speciation - extinction) %>%pivot_longer(cols =c(speciation, extinction, net_div), names_to ='process', values_to ='rate')return(rtt_comb)}# get rate through time estimates for the whole treertt_all <-get_rate_through_time_df(ephy = edata)# create meansrtt_combine_means <-group_by(rtt_all, time_point, process, time) %>%summarise(ave_rate =mean(rate), .groups ='drop',lower_ci =quantile(rate, 0.025),upper_ci =quantile(rate, 0.975))# get rate through time estimates for each shift nodertt_shift <-tibble(shift_node = shiftnodes,n =1:length(shiftnodes)) %>%nest(data = shift_node) %>%mutate(temp = purrr::map(data, possibly(~get_rate_through_time_df(ephy = edata, node = .x$shift_node, nodetype ='include'), otherwise =NA_real_)),is_tib = purrr::map_dbl(temp, is_tibble))rtt_shift_nowork <-filter(rtt_shift, is_tib ==0) %>% dplyr::select(data) %>%unnest(data)rtt_shift2 <-filter(rtt_shift, is_tib ==1) %>% dplyr::select(data, temp) %>%unnest(data) %>%unnest(temp)# create meansrtt_shift_means <-group_by(rtt_shift2, time_point, process, time, shift_node) %>%summarise(ave_rate =mean(rate), .groups ='drop',lower_ci =quantile(rate, 0.025),upper_ci =quantile(rate, 0.975))# create a plotggplot() +geom_line(aes(time, ave_rate, group = shift_node), rtt_shift_means, col ='dark grey') +geom_ribbon(aes(time, ymin = lower_ci, ymax = upper_ci, group = shift_node), alpha =0.01, rtt_shift_means, fill ='dark grey') +geom_line(aes(time, ave_rate), rtt_combine_means) +geom_ribbon(aes(time, ymin = lower_ci, ymax = upper_ci), alpha =0.1, rtt_combine_means) +facet_wrap(~process, scales ='free') +theme_bw(base_size =14) +labs(x ='Relative time',y ='rate')# save plot outggsave(here('plots/sequencing_rpoB/analyses/bamm_rate_through_time_phangorn.png'), last_plot(), height =5, width =12)```Compared to the penalized likelihood smoothing model, we can see very drastic declines in net diversification driven by rapid decreases in speciation. Generally the rate shifts are not as deep in tree and result in static speciation rate through time.You can see that after a shift node there is generally a rapid acceleration in diversification and speciation rate, after which we see a decline, whereas extinction rate rises quickly and then plateaus through time. The uncertainty around these node specific estimates is very high, but the general pattern appears to hold.## Looking at tip-specific evolutionary ratesWe can also estimate tip-specific evolutionary rates. We can plot these across our different traits to see if there is anything particularly interesting going on. This could be complemented by our models looking at state-dependent character diversification rates.The first method is to grab the tip-specific rate from the BAMM model.```{r tip_rates}#| fig.height: 5#| fig.width: 17# grab out tip rates tip_rates <-data.frame(tip_label = edata$tip.label,speciation = edata$meanTipLambda,extinction = edata$meanTipMu) %>%left_join(., dplyr::select(d_meta, tip_label, habitat_preference)) %>%filter(!is.na(habitat_preference)) %>%group_by(habitat_preference) %>%mutate(n =n()) %>%ungroup() %>%mutate(hab_pref_axis =gsub(':', '/ ', habitat_preference),hab_pref_axis =gsub('_', ' ', hab_pref_axis),net_diversification = speciation - extinction)# check numbers are rightgroup_by(tip_rates, habitat_preference) %>%tally() %>%arrange(n)p1 <-ggplot(tip_rates, aes(forcats::fct_reorder(hab_pref_axis, n), speciation)) + MicrobioUoE::geom_pretty_boxplot(col='black', fill ='black') +geom_point(shape =21, fill ='white', position =position_jitter(width =0.2)) +theme_bw(base_size =12) +scale_x_discrete(labels = scales::label_wrap(13)) +labs(x ='Habitat preference',y ='Speciation rate')p2 <-ggplot(tip_rates, aes(forcats::fct_reorder(hab_pref_axis, n), extinction)) + MicrobioUoE::geom_pretty_boxplot(col='black', fill ='black') +geom_point(shape =21, fill ='white', position =position_jitter(width =0.2)) +theme_bw(base_size =12) +scale_x_discrete(labels = scales::label_wrap(13)) +labs(x ='Habitat preference',y ='Extinction rate')p3 <-ggplot(tip_rates, aes(forcats::fct_reorder(hab_pref_axis, n), net_diversification)) + MicrobioUoE::geom_pretty_boxplot(col='black', fill ='black') +geom_point(shape =21, fill ='white', position =position_jitter(width =0.2)) +theme_bw(base_size =12) +scale_x_discrete(labels = scales::label_wrap(13)) +labs(x ='Habitat preference',y ='Net diversification rate')p1 + p2 + p3# save plot outggsave(here('plots/sequencing_rpoB/analyses/bamm_tip_rates_phangorn.png'), last_plot(), height =5, width =17)```There does not appear to be much variation between traits in their speciation and extinction rates. If we wanted to follow this further there are lots of papers we could look at. First and foremost would be to read Title & Rabosky's paper entitled ["Tip rates, phylogenies and diversification: What are we estimating, and how good are the estimates?"](https://besjournals.onlinelibrary.wiley.com/doi/10.1111/2041-210X.13153). They also share other papers that have used tip rate estimates to look at diversification rates across geographical and environmental gradients and across different traits.A second method would be to calculate the DR statistic (also known as tip DR) which is an tip-level estimate of the speciation rate separate from the BAMM model. This measure is non-model based, and incorporates the number of splitting events and the internode distances along the root-to-tip path of a phylogeny, while giving greater weight to branches closer to the present. This was first implemented by Jetz *et al.* (2012) in their Nature paper about bird diversity in space and time.```{r tip_dr}#| fig.height: 5#| fig.width: 7#| warning: false# compute tip DRd_tipdr <-DR_statistic_C(tree)# put this into a dataframed_tipdr <-data.frame(tip_label =names(d_tipdr),tip_dr =unname(d_tipdr)) %>%separate(., tip_label, c('part1', 'part2', 'part3'), sep ='_') %>%unite('tip_label', c(part1, part2), sep ='_') %>% dplyr::select(-part3) %>%left_join(., dplyr::select(d_meta, tip_label, habitat_preference)) %>%filter(!is.na(habitat_preference)) %>%group_by(habitat_preference) %>%mutate(n =n()) %>%ungroup() %>%mutate(hab_pref_axis =gsub(':', '/ ', habitat_preference),hab_pref_axis =gsub('_', ' ', hab_pref_axis))ggplot(d_tipdr, aes(forcats::fct_reorder(hab_pref_axis, n), tip_dr)) + MicrobioUoE::geom_pretty_boxplot(col='black', fill ='black') +geom_point(shape =21, fill ='white', position =position_jitter(width =0.2)) +theme_bw(base_size =14) +scale_x_discrete(labels = scales::label_wrap(13)) +labs(x ='Habitat preference',y ='tip DR (speciation rate)')```This method also shows very little variation in the estimated tip-level speciation rate between different habitat preferences.This again is something to discuss further with Rutger as I am not sure exactly where to go with the remainder of this analysis. These sorts of statistics may be very useful where fitting of SSE models with rate changes within the state are not possible. This may be something to consider with our dataset where the number of parameters in the MuHiSSE would be so big.We can can do phylogenetic generalised least squares regressions to account for phylogenteic relatedness concerning differences in diversification rate between species with different habitat preferences. This approach has been used by Jetz *et al.* (2012) in their Nature paper about bird diversity in space and time.We will first do this using **nlme::lme()** as described in Liam Revell's and Luke Harmon's book "Phylogenetic Comparative Methods in R".```{r pgls_nlme_run}#| eval: false# set up correlation matrix for the treecor_lambda <-corPagel(value =1, phy = tree, form =~tip_label)# fit phylogenetic generalised linear modelmod <-gls(tip_dr ~ habitat_preference, data = d_tipdr, correlation = cor_lambda)summary(mod)# do contrasts between habitat preferencescontrasts <-emmeans(mod, pairwise ~ habitat_preference)contrasts$contrasts```So the model fits and is ok. The estimate of Pagel's lambda - the phylogenetic signal - is 0.79 which is quite high. There are no significant differences between any of the habitat preferences in terms of their speciation rate though.## Defining each tip as either high diversification or low diversificationOne of the main questions we are interested in is whether we see a increase in the diversification rate AFTER a transition, but this is a difficult thing to test using either the BAMM model or state-dependent speciation and extinction models.One alternative might be to create a new variable of high and low diversification rate for each tip, and then combine this with habitat preference when we explore transition rate models.We can look at the distribution of tip rates, irregardless of habitat preference.```{r tip_rate_dist}#| fig.height: 5#| fig.width: 7ggplot(tip_rates, aes(net_diversification)) +geom_histogram(fill ='white', col ='black') +theme_bw()```There is a hint that this model is bimodal. We can run a k-means clustering on it to try and split it into two groups.```{r split_tip_rate}#| fig.height: 5#| fig.width: 7# do very crude kmeans clusteringmod_clust <-kmeans(tip_rates$net_diversification, 2, algo="Lloyd")# add this to our datasettip_rates <-mutate(tip_rates, cluster =as.character(mod_clust$cluster),div_rate =ifelse(cluster =='1', 'low', 'high'))ggplot(tip_rates, aes(net_diversification, fill = div_rate)) +geom_histogram(col ='black') +theme_bw(base_size =14) +theme(legend.position =c(0.8, 0.8))```We can count the number of each combination of diversification rate (high or low) and each habitat preference to see whether we have some common and rare states.```{r check_numbers}# check number in each diversity rate bin: high and lowgroup_by(tip_rates, div_rate) %>%tally()# check per diversity rate bin/habitat preference combinationgroup_by(tip_rates, habitat_preference, div_rate) %>%tally() %>%group_by(habitat_preference) %>%mutate(prop =round(n/sum(n), 2)) %>%ungroup()```Ok this looks semi-sensible. We will now save this dataset out to try and fit some transition models to the habitat preference by diversification rate bin.```{r save_out}#| eval: falsesaveRDS(dplyr::select(tip_rates, tip_label, div_rate), here('data/sequencing_rpoB/processed/div_rate_bins.rds'))```## Useful links used- [Chapter 13](https://lukejharmon.github.io/pcm/chapter13_chardiv/#ref-Beaulieu2016-ww) of Luke Harmon's book on phylogenetic comparative methods on "Characters and diversification rates".- [Documentation](https://yulab-smu.top/treedata-book/) of the R package **ggtree** used for plotting phylogenies.- [Documentation](http://bamm-project.org/introduction.html#bayesian-analysis-of-macroevolutionary-mixtures) for bamm - Bayesian Analysis of Macroevolutionary Mixtures.- Chapter 3 of Liam Revell and Luke Harmon's new book on Phylogenetic Comparative Methods in R.- [Chapter 6](https://nhcooper123.github.io/pcm-primer-online/phylogenetic-generalised-least-squares-pgls-in-r.html) of Natalie Cooper's book cover phylogenetic generalised least squares regression in R using caper.